PatientProfiler is an R package that allows personalized analysis of an entire cohort and specific individuals. The main functions can be used sequentially or separately.
If we want to imagine a sequential way to use PatientProfiler, we can divide the pipeline into 4 main steps:
STEP 1 - Raw data harmonization / Access to harmonized CPTAC data and Input tables preparation
STEP 2 - Protein activity inference
STEP 3 - Generation of mechanistic models
STEP 4 - Network-based stratification: communities detection
STEP 5 - Identification of biomarkers
To install PatientProfiler execute the following commands in R. A crucial prerequisite for PatientProfiler is SignalingProfiler package.
# Install SignalingProfiler
devtools::install_github('https://github.com/SaccoPerfettoLab/SignalingProfiler/')
# Install PatientProfiler
devtools::install_github('https://github.com/SaccoPerfettoLab/PatientProfiler/')PatientProfiler contains some functions executing Python source code To correctly set up the Python PatientProfiler environment please execute the following code line the first time you use PatientProfiler:
install_pp_py()library(PatientProfiler)
library(SignalingProfiler)
library(tidyverse)To use this R package, you need to have the right format for your input data frames.
The input has to have specific columns, as described here:
Transcriptomics:
gene_name - it indicates the names of genes/analytes
patients
Proteomics:
gene_name - it indicates the names of genes/analytes
patients
Phosphoproteomics:
gene_name - it indicates the names of genes/analytes
Site - it reperesents the phosphorylation site with aminoacid and position (e.g. T161)
Peptide - it contains the peptide sequence that includes the phosphorylated residue (optional)
patients
If the IDs of your
samples start with a number, it is necessary to modify the IDs (e.g.,
with an ‘X’ before the number).
This function allows to update UniProt IDs, add the sequence window, imputing NAs and calculating the Z-score.
To harmonize your raw omics data, you need to:
This step for big dataframes requires many hours of working mainly due the imputation. For this tutorial we are using a smaller amount of patients to give you a general vision of
###Example of usage:
transcriptomics_data <- read_tsv("./PatientProfiler_raw_input/Brca_transc.tsv")
proteomics_data <- read_tsv("./PatientProfiler_raw_input/Brca_prot.tsv")
phosphoproteomics_data <- read_tsv("./PatientProfiler_raw_input/Brca_phospho.tsv")
omics_update(
df_tr = NULL,
df_pr = NULL,
df_ph = phosphoproteomics_data,
threshold = 80,
sw_len = 7,
pep_col_name = "Peptide",
imp_method = "norm",
zscore = TRUE,
zmethod = "column",
metric = "median",
output_dir = 'PatientProfiler_processed_input'
)The result of this step is each selected dataframe manipulated according to the chosen parameters. If, for example, the phosphoproteomics update was made, it will have the information regarding phosphosites updated with the sequence_window column and Uniprot_ID added, it will be imputed, and it will have a distribution based on whether or not Z-score should be calculated.
After this step, tables are ready for the omics_preparation.
The next steps of PatientProfiler include wrapper functions to our in-house developed SignalingProfiler pipeline. As such, the harmonized data obtained with omics_update or access_harmonized_CPTAC_data should be transformed in SignalingProfiler-compliant format. To this aim, in PatientProfiler, we developed omics_preparation function.
For each uploaded omic type, a folder with a user-defined name will be created. Each folder will contain a file (data frame) for each patient, representing the needed omic input needed in SignalingProfiler-compliant format for the next steps.
tr_updated <- read_tsv("./vignette/PatientProfiler_processed_input/Transcriptomics_updated.tsv")
pr_updated <- read_tsv("./vignette/PatientProfiler_processed_input/Proteomics_updated.tsv")
ph_updated <- read_tsv("./vignette/PatientProfiler_processed_input/Phosphoproteomics_updated.tsv")
omics_preparation(
df_tr_updated = tr_updated,
df_pr_updated = pr_updated,
df_ph_updated = ph_updated,
transc_dir_name = "Transc_patients",
prot_dir_name = "Prot_patients",
phospho_dir_name = "Phospho_patients"
)This function is a wrapper to SignalingProfiler first step functions, wherein protein activity modulation is inferred from multi-omics data combining footprint-based techniques with PhosphoScore algorithm.
To extract protein activity for a single patient, you need to upload transcriptomics, proteomics, or phosphoproteomics data (or all of them) and set the parameters as described in the information part of the function (see the description by typing “?extract_protein_activity” in your console).
The parameters are managed as a list of default parameters that you can change as you wish, either inside or outside the function call.
(Note: The prot_df parameter in
tf_params and kin_params will only be used if a prot_file
is specified, since the function automatically loads proteomics data
only if correct_proteomics is TRUE and if it
finds the specified file)
tf_params <- list(reg_minsize = 5, collectri = TRUE)
kin_params <- list(reg_minsize = 5)
patient_id <- 'CPT000814'
extract_protein_activity(
prot_file = paste0("Prot_patients/Prot_Patient_", patient_id, ".tsv"),
trans_file = paste0("Transc_patients/Transc_Patient_", patient_id, ".tsv"),
phospho_file = paste0("Phospho_patients/Phospho_Patient_", patient_id, ".tsv"),
tf_params,
kin_params,
output_dir = "Activities"
)Results output
For each patient an .xlsx file
(Activity_patient_{patient_id}.tsv) will be saved inside
the output_dir directory (default, ‘Activities’), with
these columns:
UNIPROT: protein Uniprot ID.
gene_name: name of the protein.
mf: molecular functions.
final_score: activity score.
method: how final_score is calculated.
To perform the protein activity inference step for all the patients of a cohort, this function loops on each omic file for each patient and infer protein activity using [extract_protein_activity].
(Note: The prot_df parameter in
tf_params and kin_params will only be used if a prot_dir is
specified, since the function automatically loads proteomics data only
if correct_proteomics is TRUE and if it finds
files in the specified folder)
tf_params <- list(reg_minsize = 5, collectri = TRUE)
kin_params <- list(exp_sign = TRUE)
extract_cohort_activity(
prot_dir = "Prot_patients/",
trans_dir = "Transc_patients/",
phospho_dir = "Phospho_patients/",
tf_params,
kin_params,
output_dir = "Activities"
)Results output
The cohort final results will be saved as a set of .tsv files (1 file
- 1 patient=) inside the output_dir directory (default,
‘Activities’). Each patient-specific file
(Activity_patient_{patient_id}.tsv) contains these
columns:
UNIPROT: protein Uniprot ID.
gene_name: name of the protein.
mf: molecular functions.
final_score: activity score.
method: how final_score is calculated.
In this step, PatientProfiler creates for each patient a causal network connecting mutated genes to a user-defined set of cellular functional traits (e.g., APOPTOSIS, PROLIFERATION).
Importantly, this step requires a mutations file where rows are patients and columns are genes, and each cell is the impact on protein function of the mutation.
To this aim, we implemented a wrapper to SignalingProfiler network creation step
This function sequentially calls five functions to:
Select a PKN derived from SIGNOR and PhosphoSitePlus
(get_PKN);
Create a naive-network connecting mutations to inferred proteins
(create_naive_network);
Optimize the naive network over protein activity
(optimize_network_with_carnival);
Infer the activity of phenotypes from model proteins and connect
them in a proteins-to-phenotypes network
(infer_and_link_phenotypes);
Manipulate the proteins-to-phenotypes model for visualization and
functional circuits creation
(format_patient_networks).
Each step inside network creation requires a set of parameters,
extensively described in the functions documentation (type
?initialize_net_default_params” in your console. The following
function is called internally and automatically set up the default
parameters, as identified in SignalingProfiler benchmarking. However,
the user can customize the parameters for each step modifying
PKN_options, naive_options,
carnival_options, phenoscore_options,
format_options parameters list.
output_dir <- 'Networks_output'
network_params <- initialize_net_default_params(output_dir)
patient_id <- 'CPT000814'
# Read mutation file (starting nodes of the model)
mutations_df <- read_tsv('./PatientProfiler_raw_input/Brca_mutations.tsv', show_col_types = F)
sources <- mutations_df[mutations_df$Patient_ID == patient_id, ]
# Read activity file (intermediate nodes of the model)
activities <- read_tsv(paste0('./Activities/Activity_Patient_', patient_id, '.tsv'), show_col_types = F)
# Set the functional traits vector (final nodes of the model)
desired_phenotypes <- c('APOPTOSIS', 'PROLIFERATION')
# Read omics file
proteomics <- read_tsv(paste0('./Prot_patients/Prot_Patient_', patient_id, '.tsv'), , show_col_types = F)
transcriptomics <- read_tsv(paste0('./Transc_patients/Transc_Patient_', patient_id, '.tsv'), show_col_types = F)
phosphoproteomics <- read_tsv(paste0('./Phospho_patients/Phospho_Patient_', patient_id, '.tsv'), show_col_types = F)
# ProxPath proteins-to-phenotypes pre-processing
pheno_distances_table <- proxpath_preprocessing(proteomics = proteomics,
phosphoproteomics = phosphoproteomics)
create_network(patient_id = patient_id,
sources = sources,
activities = activities,
transcriptomics = transcriptomics,
proteomics = proteomics,
phosphoproteomics = phosphoproteomics,
desired_phenotypes = desired_phenotypes,
pheno_distances_table = pheno_distances_table,
output_dir = 'Networks_output',
save_all_files = FALSE,
PKN_options = list(direct = FALSE),
naive_options = list(layers = 2, max_length = c(1,4)),
carnival_options = list(carnival_type = 'vanilla_one_shot'),
phenoscore_options = list(), # keep default params
format_options = list(optimize_on_phenotypes = FALSE,
circuits_params = list(k = -1),
vis_cytoscape = FALSE))This function loops on the folder of omics and activity data of a cohort of patients creating a network of each of the patients from mutations to desired phenotypes.
For the files generated for each patient, refer to create_network function documentation.
create_cohort_networks(trans_dir = './Transc_patients/',
prot_dir = './Prot_patients/',
phospho_dir = './Phospho_patients/',
act_dir = './Activities/',
mut_file = './PatientProfiler_raw_input/Brca_mutations.tsv',
output_dir = './Networks_output/',
desired_phenotypes = c('APOPTOSIS', 'PROLIFERATION'),
pheno_distances_table = TRUE,
save_all_files = FALSE,
PKN_options = list(direct = FALSE),
naive_options = list(layers = 2, max_length = c(1,4)),
carnival_options = list(solver = 'cplex',
carnival_type = 'vanilla_one_shot'),
carnival_params = list(timelimit = 3600),
format_options = list(optimize_on_phenotypes = FALSE,
circuits_params = list(k = -1),
vis_cytoscape = FALSE)
)In this step, PatientProfiler extracts communities from the patient-specific mechanistic models generated in Step 3.
Importantly, this step requires a patient stratification file (patients_stratification.tsv), where each row represents a patient and the column indicates the subtype. You need to place this file in a designated directory (dir_path), which will be deleted after this step.
generate_communities(dir_path = "./Communities/input_communities",
network_dir = "./Networks_output",
output_dir = "./Communities/output_communities")Results output
The output_dir directory contains a subdirectory for each detected community. Each community directory includes:
A file for edges. A file for nodes. A file listing the patients belonging to the community.
Thanks to this function, we can extract signaling-driven transcriptomic signatures. To do this, you need the transcriptomics dataset obtained after Step 1 and the communities detected in Step 4.
extract_signatures(base_path = "./Communities/output_communities/",
transcriptomics_file = "./PatientProfiler_processed_input/Transcriptomics_updated.tsv",
output_dir = "./",
padj_thres = 0.01,
diff_thres = 0.5,
mean_exp_clus_thres = 0,
max_val = 50)
Results output
The Signatures directory contains a dataframe for each signature with the genes list.